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Abstract — Multilabel classification deals with problems in 
which an instance can belong to multiple classes. This 
paper uses error correcting codes for multilabel 
classification. BCH code and Random Forests learner are 
used to form the proposed method. Thus, the advantage of 
the error-correcting properties of BCH is merged with the 
good performance of the random forests learner to 
enhance the multilabel classification results. Three 
experiments are conducted on three common benchmark 
datasets. The results are compared against those of 
several exiting approaches. The proposed method does 
well against its counterparts for the three datasets of 
varying characteristics. 

Index Terms — multilabel data, multilabel classification, 
error correcting codes, BCH code, ensemble learners, 
random forests 

I. Introduction 

Singlelabel classification refers to learning from a 
collection of instances that each is related to only one 
label / from a set of labels L. Multilabel classification, 
on the other hand, refers to learning from a set of 
instances that each is associated with a set of labels Y 
£ L [1]. A sample multilabel dataset is shown is Table 
I. It consists of three instances. Each instance contains 
four features. There are three classes L - {C1,C2,C3}. 
Each instance belongs to one class or multiple classes. 





TABLE I. 

Sample Multilabel Dataset 




Ins 
t 


Features 


CI 


C2 


C3 


1 


F„ 


F, 2 


F13 


F,4 





1 


1 


2 


F 2 , 


F22 


F13 


F24 


1 


1 





3 


F31 


F 32 


F13 


F34 








1 



Multilabel classification has received some 
attentions in the past several years. A number of 
methods have been developed to tackle multilabel 
classification problems. Some key methods are 
reviewed in the following. 

According to Brinker et al. [2], Binary Relevance 
(BR) considers the prediction of each label as an 
independent binary classification task. It trains a 
separate binary relevance model for each possible label 
using all examples related to the label as positive 
examples and all other examples as negative examples. 
For classifying a new instance, all binary predictions 
are obtained and then the set of labels corresponding to 



positive relevance classification is associated with the 
instance. 

Zhang and Zhou [3] report a multi-label lazy 
learning approach which is derived from the traditional 
k-Nearest Neighbour (kNN) and named ML-kNN. For 
each unseen instance, its k nearest neighbors in the 
training set are identified. Then, based on statistical 
information gained from the label sets of these 
neighboring instances, maximum a posteriori principle 
is used to determine the label set for the unseen 
instance. 

Zhang and Zhou [4] present a neural network-based 
algorithm that is Backpropagation for Multi-Label 
Learning named BP-MLL. It is based on the 
backpropagation algorithm but uses a specific error 
function that captures the characteristics of multi-label 
learning. The labels belonging to an instance are ranked 
higher than those not belonging to that instance. 

Tsoumakas and Vlahavas [5] propose RAndom K- 
labELsets (RAKEL) which is an ensemble method for 
multilabel classification based on random projections 
of the label space. An ensemble of Label Powerset (LP) 
classifiers is trained on smaller size of label subset 
randomly selected from the training data. The RAKEL 
takes into account label correlations by using single- 
label classifiers that are applied on subtasks with 
manageable number of labels and adequate number of 
examples per label. It therefore tackles difficulty of 
learning due to a large number of classes associated 
with only a few examples. 

Several other important works can be also found in 
[6-11]. The main motivation behind the work reported 
in this paper is our desire to improve the performance 
of the multilabel classification methods. This paper 
explores the use of error correcting code for multilabel 
classification. It uses the Bose, Ray-Chaudhuri, 
Hocquenghem (BCH) code and Random Forests 
learner to form a method that can deal with multilabel 
classification problems improving the performance of 
several popular exiting methods. The description of the 
theoretical framework as well as the proposed method 
is given in the following sections. 

II. Bose, Ray-Chaudhuri, Hocquenghem Code 

Bose, Ray-Chaudhuri, Hocquenghem (BCH) Code is 
a multilevel, cyclic, error-correcting, variable-length 
digital code that can correct errors up to about 25% of 



12 



©2010 ACEEE 
DOI: 01.ijns.01.02.03 



ACEEE 



ACEEE International Journal on Network Security, Vol 1, No. 2, July 2010 



the total number of digits [12-13]. The original 
applications of BCH code were limited to binary codes 
of length 2"'-l for some integer m. These were extended 
later to the nonbinary codes with symbols from Galois 
field G¥(q). Galois field is a field with a finite field 
order (number of elements). The order of a Galois field 
is always a prime or a power of a prime number. GF(<7) 
is called the prime field of order q where the q elements 
are 0,1, ..., q-\. 

BCH codes are cyclic codes and can be specified by 
a generator polynomial. For any integer m > 3 and / < 
2"'" 1 , there exists a primitive BCH code with the 
following parameters: 

n=2 m -l 

n-k<mt (1) 

d min > 2r + 1 

The code can correct t or fewer random errors over a 
span of 2"'-l bit positions. The code is called a t-error- 
correcting BCH code over GF(q) of length n. This code 
is specified as follows: 

1 . Determine the smallest m such that GF(<7 m ) has 
a primitive nth root of unity p. 

2. Select a nonnegative integer b. Frequently, 
b=l. 

3. Form a list of It consecutive powers of P: 

ob nb-n ob+2t-i . . . 

" 'P P . Determine the minimal 

polynomial with respect to GV{q) of each of 

these powers of p. 

4. The generator polynomial g(x) is the least 
common multiple (LCM) of these minimal 
polynomials. The code is a (n, n- deg(g(x)j) 
cyclic code. 

Due to the fact that the generator is constructed 
using minimal polynomials with respect to GF(q), the 
generator g(x) has coefficients in GF(g), and the code is 
over GF(<7). Two fields are involved in the construction 
of BCH codes. G¥{q) is where the generator 
polynomial has its coefficients and is the field where 
the elements of the codewords are. GF(q'") is the field 
where the generator polynomial has its roots. For 
encoding purpose, it is adequate to work only with 
G¥{q). However, decoding requires operations in 
GF(<7<»). 

For binary BCH codes, let a be a primitive element 
in GF(2°'). For 1< i < t , let @2i-i(x) be the minimum 
polynomial of the field element a 2 ' -1 . The degree of 3>». 
i(x) is m or a factor of m. The generator polynomial 
g(x) oft-error-correcting BCH codes of length 2"'-l is 
given by: 

g(x) = LCM{* x (x\* 3 (x) ♦ 2M (x)} (2) 

The first explicit decoding algorithm for binary BCH 
codes was Peterson's algorithm that was useful only for 
correcting small numbers of errors. Berlekamp 
introduced the first truly efficient decoding algorithm 



for both binary and nonbinary BCH codes. This was 
further developed by Massey and is usually called the 
Berlekamp-Massey decoding algorithm. 

Consider a BCH code with n = 2"'" 1 and generator 
polynomial g(x). Suppose a code polynomial c(x) - co + 
Cix+ ... + c n -]x"~' is transmitted. Let r(x) = r B + riX + ... 
+ r„.i x"~' be the received polynomial. Then, r(x) = c(x) 
+ e(x), where e(x) is the error polynomial. To check 
whether r(x) is a code polynomial, r(a) = r(a 2 ) = ... = 
r(a 2 ') — is tested. If yes, then r(x) is a code 
polynomial, otherwise r{x) is not a code polynomial 
and the presence of errors is detected. The decoding 
procedure includes three steps: syndrome calculation, 
error pattern specification, and error correction. 

III. Random Forests 

Ensemble learning refers to the algorithms that 
produce collections of classifiers which learn to 
classify by training individual learners and fusing their 
predictions. Growing an ensemble of trees and getting 
them vote for the most popular class has given a good 
enhancement in the accuracy of classification. Random 
vectors are built that control the growth of each tree in 
the ensemble. The ensemble learning methods can be 
divided into two main groups: bagging and boosting. 

In bagging, models are fit in parallel where 
successive trees do not depend on previous trees. Each 
tree is independently built using bootstrap sample of 
the dataset. A majority vote determines prediction. In 
boosting, models are fit sequentially where successive 
trees assign additional weight to those observations 
poorly predicted by previous model. A weighted vote 
specifies prediction. 

A random forest [14] adds an additional degree of 
randomness to bagging. Although each tree is 
constructed using a different bootstrap sample of the 
dataset, the method by which the classification trees ate 
built is improved. 

A random forest predictor is an ensemble of 
individual classification tree predictors. For each 
observation, each individual tree votes for one class 
and the forest predicts the class that has the plurality of 
votes. The user has to specify the number of randomly 
selected variables m_, ly to be searched through for the 
best split at each node. Whilst a node is split using the 
best split among all variables in standard trees, in a 
random forest the node is split using the best among a 
subset of predictors randomly chosen at that node. The 
largest tree possible is grown and is not pruned. The 
root node of each tree in the forest contains a bootstrap 
sample from the original data as the training set. The 
observations that are not in the training set are referred 
to as "out-of-bag" observations. 

Since an individual tree is unpruned, the terminal 
nodes can contain only a small number of observations. 
The training data are run down each tree. If 
observations i and j both end up in the same terminal 
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node, the similarity between i andy is increased by one. 
At the end of the forest construction, the similarities are 
symmetrised and divided by the number of trees. The 
similarity between an observation and itself is set to 
one. The similarities between objects form a matrix 
which is symmetric, and each entry lies in the unit 
interval [0, 1J. Breiman defines the random forest as 
[14]: 

A random forest is a classifier consisting of a 
collection of tree-structured classifiers 
{h(ji.,& k ),k =1,...} where {& A : } are 
independent identically distributed random vectors 
and each tree casts a unit vote for the most popular 
class at input x. 
Fig. 1 displays a pseudo-code for the random forest 

algorithm. A summary of the random forest algorithm 

for classification is given below [15]: 

• Draw K bootstrap samples from the training 
data. 

• For each of the bootstrap samples, grow an 
unpruned classification tree, with the following 
modification: at each node, rather than 
choosing the best split among all predictors, 
randomly sample m of the predictors and 
choose the best split from among those 
variables. 

• Predict new data by aggregating the predictions 
of the K trees, i.e., majority votes for 
classification, average for regression. 



t select the number of trees to be generated K 

• for(/M;^K;Hr++) 

• draw a bootstrap sample t from the training d; 

• grow an unpruned classification tree /i(i,0j) 
tfor(i=1;i<=firaoer-o/-;jodes;i++) 

• randomly sample m predictor variables 

• select the best split from among those 
end 



• each of the K classification trees casts 1 vote for the most popular class at ii 



• aggregate the classification of the K trees and select the class with maximum votes 



Figure 1 . Pseudo-code for the random forest algorithm. 

The random forest approach works well because of: 
(i) the variance reduction achieved through averaging 
over learners, and (ii) randomised stages decreasing 
correlation between distinctive learners in the 
ensemble. The generalisation error of a forest of tree 
classifiers depends on the strength of the individual 
trees in the forest and the correlation between them. 



Using a random selection of features to split each node 
yields error rates that compare to AdaBoost [16] An 
estimate of the error rate can be obtained, based on the 
training data, by the following [15]: 

• At each bootstrap iteration, predict the data that 
is not in the bootstrap sample, called "out-of- 
bag" data, using the tree which is grown with 
the bootstrap sample. 

• Aggregate the out-of-bag predictions. On the 
average, each data point would be out-of-bag 
around 36.8% [17] of the times. Calculate the 
error rate, and call it the "out-of-bag" estimate 
of error rate. 

With regard to the 36.8%, the random forest forms a 
set of tree-based learners. Each learner gets different 
training set of n instances extracted independently with 
replacement from the learning set. The bootstrap 
replication of training instances is not the only source 
of randomness. In each node of the tree the splitting 
attribute is selected from a randomly chosen sample of 
attributes. As the training sets of individual trees are 
formed by bootstrap replication, there exists on average 

— ~ 36.8% of instances not taking part in 
e 

construction of the tree [17]. The random forest 

performs well compared to some other popular 

classifiers. Also, it has only two parameters to adjust: 

(i) the number of variables in the random subset at each 

node, and (ii) the number of trees in the forest. It learns 

fast. 

IV. Proposed Method 

This paper explores the utilization of an error 
correcting code and random forests learner for 
multilabel classification. The proposed method is called 
MultiLabel Bose, Ray-Chaudhuri, Hocquenghem 
Random Forests (ML-BCHRF). The block diagram 
description of the ML-BCHRF is shown in Fig. 2. 

The method first transforms the set of labels L using 
the Bose, Ray-Chaudhuri, Hocquenghem (BCH) 
encoding algorithm. For a k class dataset, each set of 
labels that is associated with an instance containing k 
binary values is treated as a message codeword and is 
transformed into an n bit binary values where n> k. the 
n bit binary word is called the encoded message. Then, 
the multilable classification problem is decomposed 
into n binary classification problems. Next, n random 
forests classifiers are developed one for each binary 
class. After that, the n classification decisions of n 
binary classifiers are transformed using the BCH 
decoding algorithm and again k binary values are 
obtained. Therefore, the advantage of the error- 
correcting properties of the BCH code is incorporated 
into the system that helps correct possible 
misclassification of some individual n binary 
classifiers. 
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TRAIN 



BCH-encode /c-bit label set into n-bit 
encoded label set 



Decompose multilable classification into 
single-label binary classification 



Train n random forests binary classifiers 
using instances and encoded label set 



In the first experiment, we trained and tested the 
ML-BCHRF on the Yeast dataset. The (63, 16) BCH 
encoder was used and two dummy bits were added to 
the label set making it have 16 binary bits. We used 
number-of-trees-to-grow = 60 and number-of- 
variables-at-each-split = 20. However, these two 
parameters can be varied to achieve better results. The 
ML-BCHRF results were compared against those of 
MMP [6], AdaBoost.HM [7], ADTBoost.HM [8], LP 
[5], BR [2], RankSVM [9], ML-KNN [3], RAKEL [5], 
lvsAll SVM [10], and ML-PC [11] found in the 
literature. Table III shows the experimental results for 
the Yeast dataset. 



TEST 



TABLE III. 



Present new instance to n random 
forests binary classifiers 



BCH-decode n-bit classification 
decision into /(-bit label set 



Figure 2. Block diagram description of the proposed ML-BCHRF 
method. 

For classification of a new instance, its features are 
independently presented to n binary classifiers. Then 
the n classification decisions of n binary classifiers are 
transformed into k binary values using the BCH 
decoding algorithm. The error-correcting is applied 
during this transformation that helps correct possible 
misclassification of some individual n binary 
classifiers. The bits of the k resultant binary values that 
are '1' indicate that the instance belong to the 
associated class. 

V. Experimental Results 

To evaluate ML-BCHRF, its performance was 
evaluated against a number of exiting methods on three 
different datasets. These are among the popular 
benchmark datasets for multi-label classification. Their 
characteristics are presented in Table II. 

The ML-BCHRF employs the random forest learner 
as its base classifier. The random forest learner has two 
important parameters, called number-of-trees-to-grow 
and number-of-variables-at-each-split, that can be 
varied to get the best number of tree within the forest 
for the specific training data. 

TABLE II. 

Characteristics of Benchmark Datasets 



Dataset 


Features 


Classe 

s 


Train 


Testing 


Scene 


294 


6 


1211 


1196 


Yeast 


103 


14 


1500 


917 


Mediamill 


120 


101 


30993 


12914 



Results 


for Yeast 


Method 


Hamming Loss 


MMP 


0.297 


ADTBoostHM 


0.215 


AdaBoost.HM 


0.210 


LP 


0.202 


BR 


0.199 


RankSVM 


0.196 


ML-KNN 


0.195 


RAKEL 


0.193 


lvsAll SVM 


0.191 


ML-PC 


0.189 


ML-BCHRF 


0.188 



In the second experiment, we trained and tested the 
ML-BCHRF on the Scene dataset. The (31, 6) BCH 
encoder was used. We used number-of-trees-to-grow = 
60 and number-of-variables-at-each-split = 20. The 
ML-BCHRF results were compared against those of 
BR [2], LP [5], RAKEL [5] found in the literature. 
Table IV shows the experimental results for the Scene 
dataset. 

In the third experiment, we trained and tested the 
ML-BCHRF on the Mediamill dataset. The (255, 107) 
BCH encoder was used and six dummy bits were added 
to the label set making it have 107 binary bits. We used 
number-of-trees-to-grow = 60 and number-of- 
variables-at-each-split = 20. The ML-BCHRF results 
were compared against those of LP [5], BR [2], ML- 
KNN [3], RAKEL [5] found in the literature. Table V 
shows the experimental results for the Mediamill 
dataset. 

TABLE IV. 

Results for Scene 



Method 


Hamming Loss 


BR 


0.114 


LP 


0.099 


RAKEL 


0.095 


ML-BCHRF 


0.074 



The experimental results show that ML-BCHRF has 
performed better than its reported counterparts. It has 
performed very well for three datasets of varying 
characteristics. The reason for the demonstrated 
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performance relates to the mixture of: (i) the error 
correcting capability of the BCH code, and (ii) the 
superior performance of the random forest learner. 



TABLE V. 

Results for Mediamill 



Method 


Hamming Loss 


LP 


0.046 


BR 


0.038 


ML-KNN 


0.031 


RAKEL 


0.030 


ML-BCHRF 


0.028 



VI. Conclusion 

A method was proposed that BCH-encodes labels 
and then decomposes the problem into binary 
classification. One random forests classifier is 
developed for each binary class. The classification 
decisions are BCH-decoded using the BCH decoding 
algorithm and again k binary values are obtained. The 
experimental results show that the proposed method 
has performed better than its reported counterparts. 
Future work will include experiments in which the 
parameters of the random forests learner could be 
varied for achieving better results. 
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